Defeating the Homogeneity Assumption
نویسندگان
چکیده
The statistical NLP and IR literatures tend to make a “homogeneity assumption” about the distribution of terms, either by adopting a “bag of words” model, or in their treatment of function words. In this paper we develop a notion of homogeneity detection to a level of statistical significance, and conduct a series of experiments on different datasets, to show that the homogeneity assumption does not generally hold. We show that it also does not hold for function words. Importantly, datasets and document collections are found not to be neutral with respect to the property of homogeneity, even for function words. The homogeneity assumption is defeated substantially even for collections known to contain similar documents, and more drastically for diverse collections. We conclude that it is statistically unreasonable to assume that word distribution within a corpus is homogeneous. Because homogeneity findings differ substantially between different collections, we argue for the use of homogeneity measures as a means of profiling datasets.
منابع مشابه
The Effectiveness of Acceptance and Commitment Training prgram on motivational beliefs and future time perspective for students with academic self-defeating behaviors.
Abstract: Introductin: The purpoe of this study was to study of the effectiveness of training program based on acceptance and commitment approach on motivational beliefs and future time perspective for students with academic self-defeating behaviors in valiasr University of Rafsanjan. Method: The research was a semi-experimental design with pre-test and post-test design with control group. ...
متن کاملDevelop an educational package on perceptions of school climate and its Feasibility on self-defeating academic behaviors of male students
The aim of this study was to develop an educational package on the perception of the school environment and its feasibility study on the self-defeating academic behaviors of male students. The research method was quasi-experimental with pre-test and post-test design with a control group and quarterly follow-up. The statistical population included all students studying in the second year of high...
متن کاملRemarks on the Frisch framework of hydrodynamic turbulence and the quasi-Lagrangian formulation
In this paper, we revisit the claim that the Eulerian and quasi-Lagrangian same time correlation tensors are equal. This statement allows us to transform the results of an MSR quasi-Lagrangian statistical theory of hydrodynamic turbulence back to the Eulerian representation. We define a hierarchy of homogeneity symmetries between the local homogeneity of Frisch and global homogeneity. It is sho...
متن کاملOn the elimination of the sweeping interactions from theories of hydrodynamic turbulence
In this paper, we revisit the claim that the Eulerian and quasi-Lagrangian same time correlation tensors are equal. This statement allows us to transform the results of an MSR quasi-Lagrangian statistical theory of hydrodynamic turbulence back to the Eulerian representation. We define a hierarchy of homogeneity symmetries between the local homogeneity of Frisch and global homogeneity. It is sho...
متن کاملGlobal Nonlinear Brascamp–lieb Inequalities
We prove global versions of certain known nonlinear Brascamp– Lieb inequalities under a natural homogeneity assumption. We also establish a conditional theorem allowing one to generally pass from local to global nonlinear Brascamp–Lieb estimates under such a homogeneity assumption.
متن کامل